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Abstract 

This paper describes several new algorithms for estimating the parameters of a periodic bandlimited 
signal from samples corrupted by jitter (timing noise) and additive noise. Both classical (non-random) 
and Bayesian formulations are considered: an Expectation-Maximization (EM) algorithm is developed 
to compute the maximum likelihood (ML) estimator for the classical estimation framework, and two 
Gibbs samplers are proposed to approximate the Bayes least squares (BLS) estimate for parameters 
independently distributed according to a uniform prior. Simulations are performed to demonstrate the 
significant performance improvement achievable using these algorithms as compared to linear estimators. 
The ML estimator is also compared to the Cramer-Rao lower bound to determine the range of jitter for 
which the estimator is approximately efficient. These simulations provide evidence that the nonlinear 
algorithms derived here can tolerate L4-2 times more jitter than Hnear estimators, reducing on-chip 
ADC power consumption by 50-75 percent. 
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I. Introduction 

Analog-to-digital converters (ADCs) are affected by different types of uncertainty that introduce noise 
into the samples of a signal. Two important causes of this noise are inaccurate sample times and 
variation in the measured signal amplitude. Many well-known approaches exist for separating noise 
due to amplitude error from the signal. Similarly, noise in the sample times, known as jitter, is also a 
well-studied phenomenon. However, the effect of jitter is assumed to be negligible in practice because 
of the low phase noise sampling circuitry employed in conventional ADCs. 

A. Motivation 

The increasing demand for ultra low power ADCs has renewed interest in mitigating jitter using digital 
post-processing techniques. Because the digital portion of mixed-signal systems like digital sensors and 
wireless receivers continues to shrink, the analog portion, including the clock generator for the ADC, 
dominates the size and power consumption of the chip. Such systems would benefit greatly from smaller, 
more power-efficient analog circuitry; however, the use of such circuitry, with lower voltages, significantly 
increases noise, such as jitter in the clock signal. The power consumed by an ADC is proportional to the 
desired accuracy and sampling rate HI. In 111, the speed-power-accuracy tradeoff for high-rate ADCs is 
shown to satisfy 

Speed X (Accuracy (rms))^ 

— ^— ^ w constant. (1) 

Power 

Both 111 and El demonstrate that doubling the standard deviation of the jitter cuts down the effective 
number of bits (accuracy (rms) = 2™°^) by one, so the power consumption would need to increase by 
a factor of four to achieve the same accuracy as before. Clearly, the impact of jitter is not negligible. 
Incorporating a post-processing method that can recover some of this loss in accuracy should drastically 
reduce the power consumption of most high-speed ADCs. In addition, the ability to tolerate larger 
quantities of jitter allows the use of frequency-modulated clocks, which may yield additional benefits 
like lower electromagnetic interference and radiation f5|. 

B. Problem Formulation 

In this article, nonlinear algorithms for improved classical and Bayesian estimation are developed to 
recover the parameters of a bandlimited signal. We restrict our attention to signals with a finite basis, 
leaving the development of streaming algorithms for the infinite-dimensional case a path for further 
investigation. In particular, we consider a finite segment of a periodic bandlimited signal with period K; 
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the chosen basis is one of a periodic sine function psinc;^(t) = 



sin(7ri) 



and its integer shifts. One 



Ksin(irt/K) 



could also construct a suitable example by applying a smooth windowing function, rather than assuming 
periodicity. These algorithms are applicable for other signal classes, like splines. For this work, assume 
the Nyquist frequency is normalized to tt, and the signal x{t) is sampled with an oversampling factor of 
M. The nth sample, yn, is described by the observation model: 



Let the jitter Zn be zero-mean white Gaussian noise, with variance cr^; note that the jitter is normahzed 
relative to the critical sampling period. The dominant source of additive noise is assumed to be a signal- 
independent external noise source (not quantization noise); this additive noise, denoted by Wn, is also 
assumed to be zero-mean white Gaussian noise, with variance cr^; the additive noise is normalized to 
the scale of the parameters Xk- AH the noise sources are independent, and they are independent of the 
input signal parameters x^. Define [H(z)]„ ^ = psincj^(n/M + Zn — k); then. 



where y = [yo, . . . ,yN~-i], x = [xq, . . . ,xk-i], z = [zo, . . . , zn-i], and w = [wq, . . . ,wn-i]- To 
emphasize the generality of the model, derivations in the following sections are in terms of H(z). The 
psinc function is used in experiments only for the sake of example. 

To keep notation compact, denote by p(x), p{y; x), and p(y | x), the probabihty density function (pdf) 
of X, the pdf of y parameterized by the deterministic variable x, and the pdf of y conditioned on the 
random variable x, respectively. In this article, the random variable(s) associated with a pdf are given 
exphcitly only when necessary to avoid ambiguity. Denote the d-dimensional multivariate uniform and 
normal distributions by 




k=0 



(2) 



y = H(z)x + w, 



(3) 




(4) 



and 




1 



(5) 



Then, the likelihood function corresponding to the observation model is 



classical: p{y; x) 
Bayesian: p{y \ x) 




(6) 
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The classical estimation problem is to find the estimator x(y) that minimizes the mean squared error 
(MSE), knowing only the samples y and the likelihood function: 

x(y) = argminEY[||f(y) - x||^], Vx. (7) 
f{-) 

Computing this minimum MSE estimator (if such an estimator actually exists) is rarely straightforward; 
the maximum likelihood (ML) estimator is easier to approximate and will be used instead. The Bayesian 
estimation problem has the same objective, but with the additional knowledge of a prior on x: 

x(y) = argminEx,Y[||f(y) - x||^]. (8) 
f(-) 

The Bayes least squares (BLS) estimator, E[x | y], is the solution to this minimization problem. However, 
neither the posterior density p(x | y) nor its mean have closed forms for the observation model in dS]). 

C. Related Work 

The early literature focuses on linear reconstruction for signals affected by different types of random 
jitter: 10 develops a linear interpolation filter for jittered signals, and |i7J explores the effects of sampling 
and reconstruction jitter. The sampling error due to jitter is examined in @: using low pass interpolation, 
the MSE is approximately linear in o"^. In particular, when the jitter is Gaussian and small enough, and 
the input power spectral density Sxx{j^) = MSE is approximately ^Q'^^a^. 

However, these works provide limited insight into the effectiveness of general nonlinear techniques. 
More recently, [91 provides an iterative weighted least squares fitting algorithm for computing a cubic 
spline approximation to a jittered input signal. However, the weighted least squares fit uses only a second- 
order approximation of the input signal. More along the lines of the contributions herein, ifTOll constructs 
an Expectation-Maximization (EM) algorithm to estimate a signal sampled by time-interleaved ADCs 
with timing offsets; however, the offset for each ADC stays constant over time, unlike jitter. Also, a 
Metropolis-Hastings algorithm is derived in ifTTI to estimate each sample's timing error and the jitter 
variance from a sequence of samples; this approach is similar to the Gibbs sampler used in this work. 

Another related work, |[T2l . implements several reconstruction methods for finite-length discrete-time 
signals with sparse coefficients; however, since the jitter is constrained to integer locations to preserve the 
structure of the DFT, combinatorial methods are employed in lieu of more general statistical approaches. 
In the context of mitigating read-in and write-out jitter in data storage, a MAP-based estimator is proposed 
in |fT3l for jitter generated by a first-order Markov process and the bits corrupted by such jitter. 
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D. Outline 

In Section [ill background on Gauss-Hermite quadrature and Monte Carlo methods is provided. For 
the classical estimation problem, a numerical method is developed in Section |lll] to approximate the 
Cramer-Rao lower bound, and an Expectation-Maximization (EM) algorithm is presented to compute the 
ML estimate. Two Gibbs samplers are proposed in Section ITVl for the Bayesian case, one using rejection 
sampling, and the other using slice sampling. In Section |Vl simulation results are presented for all these 
algorithms, and conclusions and future directions for research are discussed in Section [Vll The problem 
proposed here, along with further background material, is also discussed in detail in |[T4l . 

II. Background 

The likelihood function described in ([6]) does not have a closed form, and neither does the posterior 
density. As a result, expressions for the ML estimator, the Cramer-Rao bound, and the BLS estimator all 
involve integrals or expectations that cannot be evaluated directly. More generally, consider the problem 
of computing E[/(x)]. Two approaches are considered: (a) numeric integration, via Gauss quadrature; 
and (b) stochastic approximation, via rejection, Gibbs, and slice sampling. 

A. Numerical Integration 

Gauss quadrature approximates the univariate expectation / f(x)tt;(x)cix by the sum X^iLi (^i), 
where Xj and Wi are fixed abscissas and weights based on the weighting function w{x), which is the pdf 
of X. The abscissas and weights can be precomputed for a choice of w{x) and n using the eigenvalue- 
based method in [15|. Quadrature generalizes to multiple input-multiple output functions f(x), but if 
the integrand is not separable, the computational complexity scales exponentially with the number of 
components in x. 

One weighting function of particular interest is the pdf of the standard normal distribution. The related 
method of quadrature is known as Gauss-Hermite quadrature, since the abscissas and weights are derived 
from Hermite polynomials. Adapting the abscissas and weights for any univariate normal distribution is 
easy: 

n— / f{x)Nx{li,a'^)dx^^Wif{(JXi + p), (9) 
V lira J -oo 

where Xi and Wi are the abscissas and weights for the standard Normal weighting function. As discussed 
in |fT6l, the error for Gauss-Hermite quadrature goes to zero exponentially fast as long as the nth derivative 
does not increase too fast with n; a sufficient condition for convergence is that /(x) does not increase 
more rapidly than e^^ . 
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Fig. 1. Markov chain of the Gibbs sampler. Gibbs sampling consists of repeatedly sampling from these transition distributions 
in turns until the chain converges to the steady-state. Since the steady-state distribution is the joint pdf p(x), further samples 
from the transition probabilities will be generated as if they were generated by the joint distribution itself. 

B. Stochastic Approximation 

Stochastic approximation reduces the problem of calculating an expectation E[/(x)] to sampling from 
the distribution p{x) using methods like rejection sampling, Gibbs sampling, and sUce sampling. 

1) Rejection Sampling: Rejection sampling generates samples from a proposal distribution q{x) that 
envelopes the target distribution p{x); i.e. there exists c > 1 such that cq{x) > p{x). Accepted samples 
have the same distribution as p{x). 

If the normalizing constant for a target distribution is unknown, rejection sampling is still effective. 
Suppose only the form p{x) is known, so P = J p{x) dx is unknown (but assumed to be finite). Then, 
we choose c such that cq{x) > p{x), and rejection sampling works as before. 

One disadvantage of rejection sampling is that if the proposal distribution is a poor approximation to 
the target distribution, the algorithm generally will reject a prohibitively large number of samples. The 
"envelope" requirement may make choosing a good proposal distribution very difficult. The expected 
number of iterations between accepted samples is c/P. The Gibbs and slice samplers overcome this 
issue by removing the enveloping requirement. 

2) Gibbs Sampling: The Gibbs sampler is a Markov chain Monte Carlo method introduced in |T7l|. 
The desired sampling distribution p{x) is the stationary distribution of the constructed Markov chain. 
Consider the joint probability density function p(xi, X2, • • • , xk), and define a chain of random variables 

~ p{xk I xi, . . . , Xfc+i . . . , Xk)- By augmenting these variables with the last K — 2 states to 
become x_2, . . . ,x_/^,x_i, the chain becomes a Markov chain; then, p(x) is the stationary distribution 
of this chain (see Figure [T]). 

Sufficient conditions for convergence of the underlying Markov chain are outUned in |[T8l and |[T9l . 
According to |20|, separating correlated variables can slow convergence. The period until the chain 
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Fig. 2. Slice sampling of p{x) illustrated: (a) Sampling is performed by traversing a Markov chain to approximate p{x), the 
stationary distribution. Each iteration consists of (b) uniformly choosing a slice {x : p{x) — y} and uniformly picking a new 
sample x from that slice f 141. 

converges is termed the "bum-in time." Once the chain has reached its steady-state, the generated samples 
can be treated as if they were generated from the joint distribution directly. 

3) Slice Sampling: Slice sampling is a Markov chain Monte Carlo method described in fTP\ for 
sampling from p{x) by sampling uniformly from the region under the pdf. Considering the area under 
the pdf as a pair of random variables {X,Y), where X is as before, and Y G {0,p{X)), with joint 
distribution p{x,y). As in Figure |2l the joint pdf can be sampled by repeatedly drawing from the 
conditional distributions, like in Gibbs sampling. These two conditional densities correspond exactly 
to uniform distributions, the first p{y \ x) being over a single interval, and the second p{x \ y) over the 
"slice." 

Sampling from the slice {x : p{x) > y*-'^^^} is generally difficult. If X is bounded, one could simply 
uniformly sample over the whole interval until the generated sample has sufficiently high probability p{x) 
to be in the slice. If the derivatives of p{x) are available, root-finding methods such as Newton's method 
or Halley's method may be used to locate boundaries of the slice. "Shrinkage," a simple accept-reject 
method to sample from the slice described in llTl . is used in this paper. 

III. Classical (Non-Random) Parameter Estimation 

Minimizing the MSE without access to p(x) is difficult because the optimal estimator must effectively 
minimize the MSE for all possible values of x. Moreover, the MSE of a candidate estimator generally 
depends on x, so it is difficult even to evaluate the performance of any given estimator directly. As a 
basis of comparison, the Cramer-Rao bound (CRB) can provide a lower bound on the MSE performance 
of unbiased estimators, given that we can evaluate this bound for the noisy observation model. 
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The best linear unbiased estimator (BLUE) is explored as a first attempt to attain the Cramer-Rao bound. 
To improve upon linear estimation, the ML estimator is derived, since it is guaranteed to asymptotically 
achieve the CRB as the oversampling factor, M, goes to infinity. While the likelihood function in Q is 
separable, it has no closed form: 



p{t,^)= n / ■^yn(h^(^n)x,cj2)Ar,„(0,cj2)(iz„ 

n=0 



(10) 



where hj^{zn) is the nth row vector of the matrix H(z). An EM algorithm is developed to simplify the 
problem of maximizing the likelihood function. 

A. Cramer-Rao Lower Bound 

The Cramer-Rao lower bound on the MSE for the class of unbiased estimators is equal to the trace of 
Iy(x)~^, where the Fisher information matrix Iy(x) can be expressed as 

Va/(x;y)\ /9Kx;y)\n ^[^^/(xsy)" 



Iy(x)=IE 



-E 



(11) 



V dx J \ dx 

and /(x; y) = lnp(y; x) is the log-likelihood of y as a function x. An efficient estimator, which satisfies 
the CRB, can be found for the the no-jitter (z = 0) case using the general formula in f22l|: 

Xeff|z=o(y) = (H(0)^H(0))-^H(0)^y. (12) 

However, for the random jitter case, the efficient estimator would depend on x, so it is not valid. 
For the random jitter case, the lack of a closed-form for the likelihood function requires that the lower 



bound be approximated numerically. Because the likelihood function is separable, 

N-l 

^(x;y) = lnp(y„;x), 

n=0 

which means that the Fisher information in (fTTI ) can be re-written as 



Iy(x) 



EE 

n=0 



dlnp{yn;x)\ /91np(y„;x 



(13) 



(14) 



dx J \ dx 

The singleton likelihood function p{yn', x) can be computed numerically using Gauss-Hermite quadrature: 



p(y„;x) w ^?i;i7Vy„(h^(zi)x,(j^). 



(15) 



i=l 



The derivative can also be approximated using quadrature: 

dp{yn;x) 
dx 



(16) 
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51np(y„;x) 1 dp{yn;x) 



(17) 



9x piVn,^) dx 

the expression inside the expectation in ([141) becomes a complicated non-linear function of y and x. 
Since p(y„;x) is approximately a Gaussian mixture, we can resort to simple stochastic approximation to 
approximate this expectation, and hence, the Fisher information matrix. For each data sample n, generate 
Ng samples yg from the Gaussian mixture J2i=i'Wi-^y0^n{^i)^:^w)- Then, 



N-l Ns 



lyW « E EFn(ys;x), (18) 



n=0 s=l 



where 



F„(ys;x) 

^ f Y.LiWiMy^{hl{zi)x,al){yg - hl{zi)x)hn{zi) \ 
V '^lEi=iWiMy^(hl{zi)x,al) J 

_ / T,i=i WjMjys] hn (^:»)x, (jDjys - \yl{z,)x)h.n{zi) \ 

\ (^l Ef=l WiMy^ {\vl{zi)x, 0-2 ) j 

The trace of the inverse of this matrix approximates Cramer-Rao lower bound. 



(19) 



B. Linear Estimation 

It is well known that for any linear observation model y = Hx + w with zero-mean additive noise 
and random matrix H, a linear estimator Ay is unbiased if and only if AE[H] = I. If E[H] is full 
column rank, one possible linear unbiased estimator for Q is the pseudoinverse 

XL(y) = E[H(z)]ty. (20) 

The question naturally arises as to how to find the best linear unbiased estimator (BLUE), in terms of 
minimizing the MSE. In |[22ll . the BLUE is derived for a fixed matrix H. As derived in |[T4l . the BLUE 
for a linear observation model with random matrix H is 

XBLUE(y) = (E[H(z)]^Ay"iE[H(z)])-iE[H(z)]^Ay-V, (21) 

where the covariance matrix of the data Ay depends on the value of the parameters: 

Ay = E[H(z)xx^H(z)^] - E[H(z)]xx^E[H(z)]^ + all. (22) 

A sufficient condition for (|2T]) to be valid is for Ay to be a scalar matrix (a scalar times the identity 
matrix). In general, and for the observation model and prior on the jitter used here, the diagonal elements 



December 14, 2009 



DRAFT 



10 



of Ay are not equal, and Ay is not a scalar matrix. For the no-jitter case, H(z) is deterministic, and the 
BLUE is equal to the efficient linear estimator in (fT2l) . 

This efficient linear estimator for the no-jitter case is used as a baseline to compare the performance 
of the ML estimator proposed in the next section. This estimator would be optimal in the MSB sense if 
the jitter were removed from the observation model. 

C. ML Estimation using the EM Algorithm 

Operating under the assumption that a linear estimator is too restrictive to capture the complicated 
behavior of the random jitter observation model, but unable to formulate an efficient estimator directly, 
maximum likelihood estimation appears to be a promising alternative. However, maximizing the likelihood 
function is a difficult task, because the likelihood function is not convex in x, does not have a closed form, 
and is not separable. One approach would be to utilize a gradient method, or a hill-climbing method, 
to find a local maximum iterating over one variable at a time. One related method, investigated in f23l, 
involves maximizing the joint likelihood p{y, z; x) by alternating maximizing p{z \ y; x) and p{y \ z; x). 

Maximum likelihood estimation is also a classic application of the Expectation-Maximization (EM) 
algorithm, as described in |[24l . For each iteration i, the EM algorithm consists of maximizing the 
expectation 



data. Repeated iterations of this step results in the maximizing x converging to a local maximum of the 
likelihood function. 

Maximizing (1231 ) will result in finding a local maximum of the likelihood function; however, nothing 
has been said about how fast the algorithm converges to a stationary point, or if this local maximum 
coincides with the true ML estimate. The rate of convergence depends on the choice of complete data |[24l . 
Proven in ll25l is the fact that the rate of convergence is worse if the CRB for the incomplete data set is 
much greater than the CRB for the augmented data set. The SEM algorithm in ||26l also approximates 
the observed Fisher information, which can be used to establish the quality of the estimate. Since the 
likelihood function is not concave in x, the estimate depends on initial conditions, the effects of which 
are studied in [ 14|. 

If the values of the jitter were known, the problem would decompose into a linear one, and the solution 



Q(x;x(*-i)) =E[logp(y,z;x) | y;x(^-i) 



(23) 



with respect to the unknown parameters x; i.e. x' 



argmaXx Q(x; x*^* ^)). Here, z is the missing 
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is readily available. Then, selecting the jitter to be the missing data, 



lnp(y,z;x) 



1 



2cr2 

'ill 



|y-H(z)x| 



2 L||7||2 

2 



iVlog(27^^7^^,C^2 



(24) 



Expanding and substituting into the expectation in (1231) yields 



(5(x;x 



-1 



2^^ 



X 



+x'^E 
1 



2(T? 



y^y-2y^E [h(z) | y;x(^-i) 
H(z)^H(z) I y;x(*-i)] x 



E 



z^z|y;x(*-i) 



(25) 



Since this equation is quadratic in x, the maximum value x*^*) satisfies the linear system of equations 

T 



E 



H(z)^H(z) |y;x 



X 



E 



H(z) I y;x 



.^(i-i) 



(26) 



The Hessian matrix is negative-definite, so the equation is strictly concave, and this maximum is unique. 
To solve this linear system of equations, the expectations in (l26l) need to be approximated. Using Bayes 
rule, 

p(z|y;x^ ) = 11 w— TT^ • (27) 



If the jitter or additive noise were not independent (white), the expectation would not be separable. Using 
Gauss-Hermite quadrature (with / terms) on the left and right expectations, 

E[H(zfH(z)|y;x(^-i) 



N-l 



Y,^\^n{Zn)K{Zn)\yn-M'- 



1) 



n=0 

N-l 



n=0 Pyyn^^^ ') j=i 
[e[h(z) |y;x(^^i> 



(28) 



E 



Ki^n) I yn;x 



1 I 



(29) 



and p(y„;x(*^^)) is computed again using (fTSl) . The complexity of performing these approximations is 
linear in the number of samples, so for a large number of parameters, the limiting step is actually solving 
the system once the expectations are computed. 
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IV. Bayesian Estimation 

The classical estimation framework is suitable for designing a general post-processor for a sampling 
system when nothing is known about the process generating the parameters of the input signal. How- 
ever, if the generating process is known, Bayesian inference leverages that information to optimize the 
performance of the post-processor. The Bayes least squares (BLS) estimator minimizes the MSE, and 
approximating that estimator is the objective of this section. 

How should the prior distribution be chosen? A white Gaussian process is attractive for its simplicity. 
The conjugate prior is another prior employed for convenience. However, since the likelihood function 
is already intractable, a conjugate prior is unhelpful for this problem. The "least-informative" prior is 
another suitable choice, but in general, it is hard to find. The approach used here is the maximum entropy 
prior. The maximum entropy model is justified in that it imposes minimal structure on the parameters; in 
this way, the maximum entropy model is the most "challenging" prior to use. Therefore, the parameters 
should be independent, and the entropy-maximizing distribution over a finite interval is the uniform 
prior |27|. For simplicity, we will assume that the parameters lie between —1 and 1. 

Like the likelihood function in the previous section, the resulting posterior density has no closed form. 
It can be expressed in terms of the singleton likelihoods in ([TOl l: 

nl-o' pivn I X') nf=o' pK) ■ 

A. LLS Estimation 



Pi- 1 y) = . .!:^°J^7 7\\r^i ^ . (30) 



The linear least squares (LLS) estimator is defined to be the linear estimator with minimum MSE. For 
the random jitter observation model in Q, the LLS estimator is 

XLLs(y) = E[H(z)]^ (^E[H(z)H(z)^] + ^ij y. (31) 

The expectations in (OTI ) can be computed using Gauss-Hermite quadrature or some other numerical 
method. However, the optimal linear estimator has already been studied and does not show much 
improvement over the LLS estimator assuming no jitter is present. In [6], the optimal filter coefficients 
for a low pass interpolator are derived for the random jitter problem. As mentioned in the introduction, 
the MSE of a low pass interpolator is linear in the variance of the jitter |[8l . 
When no jitter is assumed, the above estimator simplifies to 

t2 



XLLSlz=o(y) = H(0)^ ( H(0)H(0)^ + ) y. (32) 
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This linear estimator is the best Unear operation that can be performed in the absence of jitter. Hence, 
the no-jitter LLS estimator is the baseUne estimator against which the BLS estimators derived later will 
be measured. The error covariance of this estimator is 

Alls|z=o = 4(^- mf (h(O)H(O)^ + 

2 2 
'^x T T 




(33) 



l + Mal/al M + al/al ' 

B. BLS Estimation using Gibbs Sampling 

The BLS estimator is well-known to have the minimum MSB of any estimator for the Bayesian 
framework. However, optimal performance comes at a price: the BLS estimator involves taking the 
expected value of the posterior distribution, which is generally difficult to compute, as it is for this 
problem. However, stochastic approximation can be helpful here. Consider the BLS estimator for (x, z). 

The expectation E[x, z | y] is an obvious application of Gibbs sampling. Sampling from p(x, z | y) 
directly is a terrible idea due to the high dimensionality of the sample space, even if it were trivial to 
generate samples. Instead, Gibbs samphng can be used to generate samples of one variable at a time, 
conditioned on the rest. 

Once enough samples have been taken so that the Markov chain is sufficiently close to the stationary dis- 
tribution, additional samples are averaged to approximate the BLS estimator. In the description below, 
represents the "bum-in time", the number of iterations until the Markov chain has reached its steady state, 
and / represents the number of samples to generate after convergence has been achieved. 

Require: y,I,Ib 
2,(0) = 0, x(0) = 
for i = 1 : J + Jb do 



4 - p(-i4^4^-^\...,4:V,x(-i),y) 



~ M-|z«,4^-'\...,4~\\y) 

~ p(.|zW,x«,4-^\...,4lV,y) 

Xk-1 ~ P{- I z^*^ ,Xo\..., x^^_2, y) 
end for 

return x, z 
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Consider generating samples z„ from the distribution p{- \ z_„,x,y). Using Bayes rule, and the 
independence of Zn and Wn, 

p(z_„,x, y) 

oc My^ (h^(^n)x, ^2 )A4„ (0, ct2). (34) 
The full-conditional distribution of z„ is independent of z_„, so sampling the jitter values can be easily 
grouped together. Because this functional form is enveloped by the prior on z„, rejection sampling is an 
obvious choice to produce samples. The proposal density is q(zn) = Mz„(0,(j'^), and the scaling factor 
is c = 1/V27rcr2. In fact, since h^(2;„)x < ||h„(z„)||2||x||2 = (l)||x||2, 

1 



< 



1 



exp 



: exp 



yl - 2y„h^(z„)x + (h^(z„)x)2 



2al 



2yn||x||2 



2al 



^/?^-2^/,^||x|| 

2crl 



(35) 

and cq{zn) would more tightly 



and if y„ > 2j/„||x||2, c can be multiplied by exp 
envelope (l34l) . 

Unlike the independent z^s, the conditional distribution on Xk does depend on the other parameters 

x-fc: 

p(y I z,x)p(z)p(x) 



p{xk I x_fc,z,y) 



P(x--fc,z,y) 
ocAAy(H(z)x,cT^I)[/,,(-l,l). 



(36) 



Denote by Hfc(z) and H_fc(z) the columns of H(z) that multiply and x_fc in (|36l ). respectively. The 
truncated normal distribution in (l36l ) can be modified to be a distribution explicitly on Xk'. 



where 



Hfc(z)^(y-H_fe(z)x_fc) 



H,(z)^H,(z) ' H,(z)^Hfc(z)- 
Sampling from a truncated normal distribution is a typical application of cdf inversion. When the mean 
lies between —1 and 1, the cdf is almost linear, and the cdf inversion technique works well. However, 
when the mean lies far outside this range, the cdf is practically flat, and the technique is limited by the 
precision of the computer. A rejection sampling-based approach for producing samples of a truncated 
standard normal distribution is developed in ll28l . For a standard normal distribution truncated to the 



(37) 



(38) 
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Fig. 3. Example plot of the unnormalized pdf f{zo) — p{zo j y,x), proportional to p{zo \ x, y), with respect to zo, for 
K = 10, (Tiu — 0.01, az — 0.5, and M — 16. The proposal distribution for rejection sampling (a zero-mean normal distribution 
with variance a^) is also shown for reference. Note that even for values of zo likely according to the proposal density q{zo), the 
target density is many orders of magnitude smaller, so rejection sampling will reject an extraordinarily large number of samples 
before accepting a sample. 



interval [a,b], where a > 0, the proposal distribution should be an exponential distribution shifted to 
cover X > a, whose optimal scale factor 

a- = ^±^. (39) 

The case where [a, b] lies to the left of zero, generating samples is the mirror-image of this problem, so 
the optimal exponential distribution is flipped and shifted to cover x < b, and a* = -1(6 - V^^ + 4). 
According to [28], a uniform distribution Ux{a,b) is the preferred proposal distribution when a and b 
are too close together, so that a disproportionately large number of samples are not rejected due to the 
X < b or X > a requirement. 

The parameters Xk are all highly correlated, so one might wonder if it is possible to use a similar 
technique to generate realizations of the multivariate truncated normal distribution p(x | y, z) directly. 
However, as |[28l laments, the rejection-sampling method has no natural extension to multivariate truncated 
normal distributions when the variables are correlated. Gibbs sampling is recommended to reduce the 
problem down to univariate sampling, which is what is proposed here anyway. Sampling Xk one at a 
time has the added benefit of extending naturally for a less trivial generating process for x. 



December 14, 2009 



DRAFT 



16 



C. Improvement via Slice Sampling 

While the rejection sampler provides an exact method for producing realizations of a target distribution, 
a significant disparity between the shape of the proposal and target distributions causes a large number of 
samples to be rejected. In the case of rejection sampling employed in the Gibbs sampler to generate jitter 
values Zn from the density (|34l ). finding the smallest value for c is difficult, and the shape of the target 
distribution can vary depending on the parameters. Empirical evidence, shown in Figure [3l portrays the 
extent of the problem with rejection sampling, when cr^ a^- High oversampling only compounds this 
phenomenon. 

Slice sampling is not susceptible to such problems since no tightly enveloping proposal density is 
necessary; the ability to evaluate an unnormalized form of the target distribution is sufficient. Thus, the 
expression in ( [34l ) can still be used. Each iteration of slice sampling consists of two uniform sampling 
problems: 

1) Choose a slice u uniformly from [0,p(zi*^ | y,x)]. 

2) Sample zi*^^^ uniformly from the slice S = {zn '■ p{zn | y, x) > u}. 

The first step is trivial, since we are sampling from a single interval. The second step is more difficult. 
However, since u < p{zn \ y, x) for all z„ in the slice, 

\ogu < 2^2 ~ \og{2'n(j^cjyj) 

z^ 

< -^-log(2w.cT^). (40) 
Solving for Zn, the range of possible z„ is bounded: 



\zn\ <az\l-2\ogu-2\og{2T:ayjaz). (41) 

Using these extreme points for the initial interval containing the slice, and the "shrinkage" method 
specified in lf2Tl to sample from the slice by repeatedly shrinking the interval, slice sampling becomes 
a relatively efficient alternative to rejection sampling. The "shrinkage" method decreases the size of the 
interval exponentially fast, on average. Consider one iteration of shrinkage, where the previous point xq 
lies in the interval [L,R\. Then, the expected size of the new interval [V ,R'] is 

1 



¥.[R! -L' I i?,L,xo] 



rxo i-R 

/ {R — x) dx + {x — L) dx 

J L Jxo 



R-L 

_R^-2RL + L'^ xo{R + L - xo) - RL 

2{R- L) ^ R-L 
_ R-L xojR + L-xo)- RL 

" 2 + R-L ■ ^ ^ 
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This expectation is quadratic in xq, so the maximum occurs at the extreme point xq = {R + L)/2. The 
maximum value is 

maxE[i?' — L' I i?, L, xq\ 

_R-L ^{{R + L)/2){R + L-{R + L)/2) - RL 

.4^+(i^±^!Zizi?£.f,„_,,, (43) 

Concavity implies that the minima are at the two endpoints xq = L and xq = R. In both cases, the 
expected size of the interval is {R — L)/2. Therefore, 

^{R-L)<E[R' -L' I R,L,xo] < ^iR-L), (44) 

which implies that at worst, the size of the interval shrinks to 3/4 its previous size per iteration, on 
average. Then, given the initial interval [Lq, Rq] and previous point xq, the expected size of the interval 
[Lj,Rj] after I iterations of the shrinkage algorithm is 

E[Ri - Li I Ro,Lo,xo] 

= E[E[i?/ - Lj I Ro,Lo, . . . ,i?/_i,L/_i,xo] | Ro,Lo,xo] 
3^' 



< ^-j {Ro - Lo). (45) 

If the target distribution p{x) is continuous, the algorithm is guaranteed to terminate once the search 
interval is small enough. Since the interval size shrinks exponentially fast, on average, the number of 
"shrinkage" iterations is approximately proportional to the log of the fraction of the initial interval 
contained in the slice. In a discussion included in ETI . a binary search-like shrinkage algorithm is 
proposed that can converge faster on the slice than the algorithm used here. Incorporating such an 
approach to accelerate the slice sampler merits future investigation. 

Using slice sampling to generate the jitter values when is large relative to improves the speed 
of the Gibbs sampler. However, the addition of new auxiliary variables through slice sampling can be 
expected to slow the Gibbs sampler's overall rate of convergence. Thus, both algorithms are included for 
simulation. 



V. Simulation Results 

The objectives of this section are: (a) to determine when jitter significantly affects the samples of a 
signal and (b) to demonstrate that nonlinear post-processing can improve upon linear estimation of the 
signal parameters. The convergence behavior of all the included algorithms, as well as their sensitivity to 
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initial conditions, is presented in |[T4l . This article focuses on the performance analysis of these algorithms, 
using Monte Carlo simulation to determine the MSE of each algorithm for the different parameter choices 
(specified in each figure). As in the rest of the paper, cj^ is relative to the critical sampling period T = 1, 
and a^} is relative to the scale of the signal parameters Xk ~ Ux{—l,l)- 

A. Evaluating the EM Algorithm 

1) Convergence: The convergence properties of the EM algorithm were discussed briefly in the 
background section. Intuitively, increasing M, increasing cr^, or decreasing increases the additional 
information from knowing the missing data z, slowing the rate of convergence. This behavior is observed 
in the first simulations in |[T4l . In addition, these simulations suggest that the proposed algorithm converges 
exponentially fast. 

The other major sticking point for using an EM algorithm is the local nature of the solution. If the 
likelihood function has many local maxima, the EM algorithm may not necessarily converge to the global 
maximum. Instead, the local maximum is determined by the choice of initial conditions; i.e. the initial 
value of the parameters x. Simulations in |[T4l suggest the sensitivity to initial conditions increases with 
the non-concavity of the likelihood function. Larger M, larger a^, or smaller increase the rate and 
severity of oscillations in the likelihood function, precipitating more local maxima. In cases when the 
sensitivity is an issue, using multiple random starting points or approaches like the deterministic annealing 
EM algorithm |[29l can help alleviate the problem. 

2) MSE Performance: The EM algorithm is compared against two linear estimators. First, to demon- 
strate the MSE improvement attainable by not restricting ourselves to linear estimators, the EM algorithm 
is pitted against the linear unbiased estimator in ( [20l ). However, a major motivating factor for developing 
these algorithms is to reduce the power consumption due to clock accuracy. By comparing the EM 
algorithm against the optimal linear estimator for the no-jitter observation model, the EM algorithm 
can be shown to achieve the same MSE as the linear estimator for a substantially larger jitter variance, 
reducing the clock's power consumption. 

When the additive noise dominates the jitter (az ^ cr^), the improvement can be expected to be 
minimal, since the system is nearly linear, and the jitter is statistically insignificant. As the amount of 
jitter increases, the density function p{z \ y; x) used in each iteration of the EM algorithm becomes more 
non-linear in z, and the quadrature becomes less accurate for a given number of terms. Therefore, the EM 
algorithm generally takes longer to converge, and the result should be a less accurate approximation to 
the true ML estimator. This behavior can be observed in Figure |4l where the EM algorithm is compared 
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Fig. 4. EM Algorithm Performance: Tliis plot compares the EM algorithm to the linear unbiased estimator in ( 120b and the 
no-jitter efficient linear estimator in ( 112b for K — 10, M = 16 and = 0.05. More extensive plots can be found in 1141 . 



against both the linear unbiased estimator in ( [20l l and the efficient no-jitter linear estimator in ([T2l ). The 
linear unbiased estimator has lower MSE for higher M, and the EM algorithm generally has lower MSE 
than either linear estimator. 

To answer the question of how much more jitter can be tolerated for the same desired MSE using the 
EM algorithm, the maximum proportional increase is plotted as a function of M and aw in Figure \5\ The 
maximum proportional increase for a choice of M and is computed by approximating log-log domain 
MSE curves, like those in Figure IH with piece- wise linear curves and interpolating the maximum distance 
between them over the range of az up to 0.5. The proportion of improvement increases logarithmically 
as M increases, and the improvement stays approximately the same for different values of cr^. However, 
the maximum improvement shown for = 0.5 is low because no az > 0.5 is tested. The maximum 
improvement factor shown corresponds to the power consumption dropping by 50 to 75 percent. 

As a final experiment, the Cramer-Rao lower bound is compared to the unbiased linear estimator (l20l) 
and EM algorithm to measure the efficiency of the algorithms. Although computational difficulties prevent 
a complete comparison for every possible value of x, carrying out a comparison for a few randomly chosen 
values of x provide a measure of the quality of the algorithms. As the curves in Figure [6] demonstrate 
for one such random choice of x, both algorithms are approximately efficient for small az, but the EM 
algorithm continues to be efficient for larger values of az than the linear estimator. Since the ML estimator 
is asymptotically efficient, as M —>■ cxd, the EM algorithm should be efficient for all az with large enough 
oversamphng. 
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0.025 0.05 0.1 0.25 
Additive noise std. deviation fa ) 

(b) 



Fig. 5. EM Algorithm Performance: These graphs show the maximum factor of improvement in jitter tolerance, measured by 
(Tz, achievable by the EM algorithm. Holding = 0.1 fixed, (a) shows the trend in maximum improvement as M increases, 
and (b) shows the trend in maximum improvement as o-u, increases holding M = 8 fixed. 




Fig. 6. EM Algorithm Performance: For a randomly chosen x (K — 10), the Cramer-Rao lower bound is plotted for various 
choices of o-^, and M = 16 and = 0.05. Then, the MSE of 100 trials of both the linear unbiased estimator and the EM 
algorithm are plotted against the CRB. Similar curves for M = 2 and M = 4 can be found in 1141 . 
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Fig. 7. Gibbs Sampler Performance: This plot compares both Gibbs sampler algorithms to the random-jitter and no-jitter LLS 
estimators (in Oil ) and i32i . respectively). All four MSB curves are plotted for K — 10 parameters, M — 16, and (t„ = 0.05. 
Additional data can be found in 1141 . 



B. The Gibbs Samplers 

1) Convergence: The Gibbs sampler is known to converge if the Markov chain is irreducible and 
aperiodic; however, the rate of convergence needs to be established. These results can be used to tune 
the "bum-in" time If, and the number of intervals / after the chain has sufficiently converged. In [141, 
the convergence rates of the Gibbs sampler and Gibbs sampler using slice sampling are simulated to 
determine trends in M, cr^, and a^- The essential result shown is that the convergence rate increases 
when jitter or additive noise power decrease, or when the oversampling factor increases. 

2 ) MSE Performance: The MSE performance of both Gibbs samplers is determined for a wide range 
of oversampling factors M, jitter variance a^, and additive noise variance cr^. Like the EM algorithm, the 
Gibbs samplers are expected to outperform both linear estimators when the jitter dominates the additive 
noise {az ^ cr^) and when the jitter is not so large that the Gibbs sampling methods become inaccurate 
for the given number of samples. The Gibbs sampler using slice sampling is also expected to be more 
accurate for large az, since rejection sampling fails (the acceptance probability becomes too small). 
Figure |7] displays all these expected behaviors, for the example of M = 16 and cr^ = 0.05. 

To summarize the performance of the Gibbs samplers, the maximum factor of improvement in az 
attainable by using these sampling algorithms is plotted against increasing M and aw in Figure [H These 
plots portray the same trends as observed for the EM algorithm; however, these algorithms appear to 
handle the high cr^ case better than the EM algorithm does. 



December 14, 2009 



DRAFT 



22 




Fig. 8. Gibbs Sampler Performance: These graphs show the maximum factor of improvement in jitter tolerance, measured by 
(Tz, achievable by both Gibbs samplers. Holding — 0.1 fixed, (a) shows the trend in maximum improvement as AI increases, 
and (b) shows the trend in maximum improvement as increases holding M = 8 fixed. 



VI. Conclusion 

The results of the previous section are very encouraging from a power-consumption standpoint. A 
maximum improvement of between 1.4 to 2 times the jitter translates to a two-to-fourfold decrease in 
power consumption, according to ([T]). To put the magnitude of such an improvement in context, consider 
the digital baseband processor for ultra-wideband communication in |[30l . This processor incorporates an 
ADC and a PLL, which consume 86 mW and 45 mW, respectively, out of a 271 mW budget for the chip. 
Reducing by a factor of two the power consumed by the ADC alone would decrease the total power 
consumption of the chip by almost sixteen percent. 

While effective, both the EM algorithm and the Gibbs samplers are computationally expensive. One 
benefit of digital post-processing is that these algorithms can be performed off-chip, on a computer or 
other system with less limited computational resources. For real-time on-chip applications, Kalman-filter- 
like versions of the EM algorithm or Gibbs sampler would be more practical; this extension is a topic for 
further investigation. Related to real-time processing is developing streaming algorithms for the infinite- 
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dimensional case, extending this work for general real-time sampling systems. Another future direction 
involves modifying these algorithms for correlated or periodic jitter. 

Sampling jitter mitigation is actually just one application of these new algorithms. In the frequency 
domain, jitter maps to uncertainty in frequency; using algorithms such as these should produce more 
rehable Fourier transforms for systems like spectrum analyzers. In higher dimensions, tuning noise 
becomes location jitter in images or video. Greater tolerance of the locations of pixels in images would 
allow camera manufacturers to place smaller pixels closer together, enabling higher quahty images to be 
acquired using conventional photosensor technology. This paper shows that significant improvements over 
the best linear post-processing are possible; thus, further work may impact these and other apphcations. 
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